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A procedure for bibliographic author metadata extraction from scholarly texts is presented. The 
author segments are identified based on capitalization and line break patterns. Two main author 
layout templates, which can retrieve from a varied set of title pages, are provided. Additionally, 
several disambiguating rules are described. 



I. INTRODUCTION 

cK: 

■ Several recognition methods that target bibliographical metadata extraction from scholarly texts had been described. 
\ Based on the underlying methodological basis, they are formally grouped into two categories. One, named knowledge 
representation and template mining, uses prior pattern knowledge to segment texts and to retrieve data [H, [3, H, 0, H, 
\ The other jwhich is based on general machine learning techniques, applies statistical devices, concretely hidden 

. Markov models support vector machines [HI, and conditional random fields jlJl], to infer segmentations. The 

'■^ knowledge of structural patterns is thus automatically embedded into transition and conditional probability matrices 
\l that are precomputed from sample sets. 

Practical implementations of these methods, not only the knowledge representation ones, but to some extend the 
^ 1 , machine learning ones, include a priori information of the textual patterns. Described patterns are font size and 
/— s ■ typeface variations, HTML markup and delimiters, relative length of title versus author segments, predefined segment 
ordering, and textual punctuation marks. Capitalization and line breaks, albeit being noticed in references [3. fill, [l^ . 
^ I have eluded much of the attention and their potential has not been fully exploited. 

This article focuses on the extraction of the author's field. The presented procedure is based on the identification 
of simple, yet general templates or regular expressions. Author segments are recognized solely upon capitalization 
patterns and line break delimiters. Section [IT] describes a minimalist text encoding or tokenization, and gives the 
^ corresponding author templates. Section [ITT] analyzes layout templates. Both author and layout templates emerge 
from the examination of plain text conversions of 2350 title pages. Interestingly, scholarly text layouts constrain 
author segments to only two main templates. This is in sharp contrast with general proper name grammars, which 
consist of over a hundred rules ^IS] . Section llVI provides several disambiguating rules, and describes the use of a 
common name prefix lexicon that particularizes the procedure to specific domains. Section |V] gives the details of the 
data set and other implementation details. 



II. TEXT ENCODING AND AUTHOR NAME TEMPLATES 

Words, which are defined here as contiguous sequences of two or more letters, are mapped to a single symbol code. 



O 
OS 
O 

> 

X . . 

5h Different symbols are assigned based on whether all letters are uppercase, only the first letter is uppercase, or at 
. . . least the first letter is lowercase. Additionally, initials, line breaks, relevant punctuation marks, and some special, 
high-frequency words, have a particular code. Spaces are only meaningful to discern word boundaries and are omitted 
in the encoded string. The complete code mapping is listed in Table [D 

This encoding or tokenization, besides simplifying string matching, highlights the author patterns. The text 

PhilosophicE Naturalis Principia Mathematica 
Isaac Newton 

for example, produces the encoded string [LnnnnLnnL\ , from which segment \nn\ is identified as an author pattern. 
The inspection of the complete title page data set produces the following author name templates 

Ai ~ [(? : nXnlnXXnlXnlXnnllXnlXZInlnXnnlnnXnlXTnnlnnlnnn)] , (1) 

and 

Au = [(? : [nN]JN|[nN]XJN|JN|J[nN]N|JJN|XZXN|[nN]X[nN]N|[nN]NJN|JJ[nN]N|[nN]N|[nN][nN]N)J, (2) 
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Code 


Description 




N 


Name 


All uppercase 


n 


Name 


First letter uppercase 


I 


Initial 


Single, uppercase letter 


w 


Word 


At least first letter lowercase 


P 


Period 




? 


Comma 




5 


Semicolon 






Colon 




& 


And 


Conjunction 'and' 


L 


Line 


Line break 


a 


Adparticle 


Special particles that precede words 





Other 


Any other symbol except space 



TABLE L Coding scheme. Word boundaries are identified by non-letter symbols and spaces. Spaces are omitted in the encoded 
string. 



with initial X being 

j=rip{o,i}j. (3) 

Exceptions to these patterns are one single-letter last name, and two four-word names on approximately ten thousand 
names. The relatively frequent cases of hyphenated names, and names with personal articles, prepositions and 
qualifiers, do match the patterns by simply reclassifying hyphen as being letter, and annexing personal particles in 
a preprocessing step. While reversed order naming is frequent in references and databases, only natural ordering 
appears in the title page data set. 

III. AUTHOR LAYOUT TEMPLATES 

Once the text content is extracted from a document, any stylistic layout, except capitalization and line breaks, is 
stripped. Then emerges the bare author layout. Two layouts are found, the single block of authors 

= M([,;&L]+^)*(? = L)J, (4) 

and the multiple block, with the author lines Bg being followed, possibly, by the author addresses 

£ = r(LrL]*)j. (5) 

The generic author template A stands for Ai and Au- The two are applied sequentially. A three-block layout, for 
instance, will be 

Bs = r(B,)£{0, 7}(B,)£{0, 7}(B3)J . (6) 

The set of author A and layout B templates is sufficient to correctly extract an 83% of the authors in the data 
set. To extract the remaining, the patterns [LnnLj and [LnnnLj , and the corresponding all-uppercase cases, must be 
disambiguated. 

IV. ADPARTICLES AND LEXICONS 

Most of the [LnnLj and [LnnnLj patterns in the text that are not actual proper names arc parts of uppercase titles 
spanning multiple lines. Others are publisher's tags, such as Open Access, and a few of them are addresses. An analysis 
of over one hundred thousand titles has identified the most frequent words as being of, the, and, in, for, from, with, to, 
and on. On average, preposition of appears on any title, and on in one of every ten titles. These high-frequency words 
are encoded as [aj , and are referred here as adparticles, meaning that they are connected to, or connect other words. 
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Example 



Template 



Nonlinear Theory of 
Shallow Shells 

... in Linear and 
Sublinear Time 

... and Potentials: 
Concept Elaboration 

Unrestricted Hartree-Fock Then and 
Now 



51 = raL + [nN]{l,2}J 

52 = [a[nNw]&L + [nN]{l, 2} J 

53 = [: L+ [nN]{l,2}J 

54 = [[nN] * &L[nN]Lj 



TABLE II: Scape templates. Examples of ambiguous cases of capitalized common names, and possible templates to safely 
scape them. 



Lexically, adparticles would be adpositions, adverbs, articles, conjunctions, coverbs, prepositions, or some verbal forms 
such as using. Adparticles possess, therefore, the desirable property of permitting to safely scape subsequent words. 
Adparticle scape templates are listed in Table together with a case example. 

Additionally, a common name lexicon has been recompiled to disambiguate cases as the previously mentioned Open 
Access., or noisy Email Alerts, which might appear when texts are extracted from abstract web pages. The lexicon is 
composed of frequent prefixes, and it is used to lowercase words before encoding the text. The prefixes are obtained 
according to the procedure described in Section |Vl in order to avoid lowercasing actual proper names. Effective 
lexicons are domain specific, as the word frequency is. Approximately, fifty prefixes have been sufficient to correctly 
extract all the authors from the 2350 item data set. 



V. REMARKS 



Data Set. Author name and layout templates have been identified based on the title page of 2350 scientific works. 
They include a variety of journals and publishers, in addition to self-published drafts and preprints. Most of 
these works have a single-block author layout. Approximately 400 works have a multi-block layout. The set is 
a curation from an initial set of 2600 works. Exclusion was due to either an extremely poor conversion to plain 
text, or due to the existence of infrequent patterns that would conflict with general templates. Disregarded 
patterns are single-letter last names, four-word names, lack of separator among coauthors, and line breaks 
between fore and last name. 

Common Name Lexicon. To lowercase common words without conflicting with real proper names the following 
procedure has been devised. Given a list of author names, and a list of common word candidates, the shortest 
preflx of each candidate that is not an author prefix is recorded. In this way, a list of unique shortest prefixes is 
obtained, together with the cumulative frequency of each. The most frequent prefixes are then included in the 
lexicon. Note that the words low or water, even though they are frequent in chemical texts as common names, 
are also proper names, and, therefore, are not included. The overlap between common and proper names might 
be huge in an international setting. Still, frequent scientific terms, such as functional or tetrahedron appear 
apart from proper names. 

Additional Data and Software. Conversions of the title pages to plain text has been accomplished using Xpdf 
^14] . Its pdftotext utility has been modified, setting the rasterization parameter maxIntraLineDelta to 0.2 in 
order to eliminate possible author superscripts. Approximately one hundred thousand PubMed citations 
has been processed to analyze title words and author names. Two lists, one with over sixty thousand unique 
words and their frequencies, and another with over one hundred thousand unique fore and last names, have 
been recompiled. These two lists have been used to build the prefix lexicon, as described above. While no more 
than fifty prefixes are required to correctly retrieve all the authors from the data set, for the sake of a greater 
generality, the lexicon has been enlarged up to 450 entries in the current implementation. The procedure for 
the author field extraction has been implemented in the Cb2Bib program [l^, version 1.1.1, and it is part of its 
set of recognition algorithms. 
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